Training Logistic Regression via Stochastic Gradient Ascent

The goal of this notebook is to implement a logistic regression classifier using stochastic gradient ascent. You will:

  • Extract features from Amazon product reviews.
  • Convert an SFrame into a NumPy array.
  • Write a function to compute the derivative of log likelihood function with respect to a single coefficient.
  • Implement stochastic gradient ascent.
  • Compare convergence of stochastic gradient ascent with that of batch gradient ascent.

Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create. Upgrade by

   pip install graphlab-create --upgrade

See this page for detailed instructions on upgrading.

In [1]:
import graphlab as gl
print('gl.version: %s' % (gl.version))
gl.canvas.set_target('ipynb')

from __future__ import division
import math
import string

# my imports
import pandas as pd
import numpy as np
from six.moves import cPickle as pickle
gl.version: 1.8.5
In [2]:
from types import MethodType
def value_counts( self ):
    import pandas as pd
    pdDf = self.to_dataframe()
    for ftr in pdDf.columns:
        print(pdDf[ftr].value_counts())
        
#SFrame.value_counts = MethodType(value_counts, None, SFrame)
#setattr(SFrame, 'value_counts', value_counts)
#setattr(glbObsAll, 'value_counts', value_counts)

def SFrameShow( self , catClmn = None, title = None):
    if (title != None):
        print('%s Summary:' % (title))
    print(self.shape)
    self.show()
    if catClmn != None:
        for clmn in catClmn:
            print('  %s category summary:' % (clmn))
            glbObsAll[clmn].show(view = 'Categorical')

    print(self)

Load and process review dataset

For this assignment, we will use the same subset of the Amazon product review dataset that we used in Module 3 assignment. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.

In [3]:
glbObsAll = gl.SFrame('data/amazon_baby_subset.gl/')
2016-03-20 11:23:49,313 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1458487426.log
This non-commercial license of GraphLab Create is assigned to bbalaji8@gmail.com and will expire on December 09, 2016. For commercial licensing options, visit https://dato.com/buy/.
In [4]:
SFrameShow(glbObsAll, catClmn = ['rating', 'sentiment'], title = 'glbObsAll')
glbObsAll Summary:
(53072, 4)
  rating category summary:
  sentiment category summary:
+-------------------------------+-------------------------------+--------+-----------+
|              name             |             review            | rating | sentiment |
+-------------------------------+-------------------------------+--------+-----------+
| Stop Pacifier Sucking with... | All of my kids have cried ... |  5.0   |     1     |
| Nature's Lullabies Second ... | We wanted to get something... |  5.0   |     1     |
| Nature's Lullabies Second ... | My daughter had her 1st ba... |  5.0   |     1     |
|  Lamaze Peekaboo, I Love You  | One of baby's first and fa... |  4.0   |     1     |
| SoftPlay Peek-A-Boo Where'... | Very cute interactive book... |  5.0   |     1     |
|   Our Baby Girl Memory Book   | Beautiful book, I love it ... |  5.0   |     1     |
| Hunnt® Falling Flowers... | Try this out for a spring ... |  5.0   |     1     |
| Blessed By Pope Benedict X... | very nice Divine Mercy Pen... |  5.0   |     1     |
| Cloth Diaper Pins Stainles... | We bought the pins as my 6... |  4.0   |     1     |
| Cloth Diaper Pins Stainles... | It has been many years sin... |  5.0   |     1     |
+-------------------------------+-------------------------------+--------+-----------+
[53072 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Just like we did previously, we will work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:

  1. Remove punctuation using Python's built-in string manipulation functionality.
  2. Compute word counts (only for the impWrd)

Refer to Module 3 assignment for more details.

In [5]:
import json
with open('data/important_words.json', 'r') as f: 
    impWrd = json.load(f)
impWrd = [str(s) for s in impWrd]
print impWrd

# Remote punctuation
def rmvPunctuation(text):
    import string
    return text.translate(None, string.punctuation) 

glbObsAll['review.Cln'] = glbObsAll['review'].apply(rmvPunctuation)

# Split out the words into individual columns
for word in impWrd:
    glbObsAll[word] = glbObsAll['review.Cln'].apply(lambda s : s.split().count(word))
['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'return', 'amazon', 'different', 'top', 'want', 'problem', 'know', 'water', 'try', 'received', 'sure', 'times', 'chair', 'find', 'hold', 'gate', 'open', 'bottom', 'away', 'actually', 'cheap', 'worked', 'getting', 'ordered', 'came', 'milk', 'bad', 'part', 'worth', 'found', 'cover', 'many', 'design', 'looking', 'weeks', 'say', 'wanted', 'look', 'place', 'purchase', 'looks', 'second', 'piece', 'box', 'pretty', 'trying', 'difficult', 'together', 'though', 'give', 'started', 'anything', 'last', 'company', 'come', 'returned', 'maybe', 'took', 'broke', 'makes', 'stay', 'instead', 'idea', 'head', 'said', 'less', 'went', 'working', 'high', 'unit', 'seems', 'picture', 'completely', 'wish', 'buying', 'babies', 'won', 'tub', 'almost', 'either']

The SFrame glbObsAll now contains one column for each of the 193 impWrd.

In [6]:
# Exclude impWrd for show
SFrameShow(glbObsAll[list(set(glbObsAll.column_names()) - set(impWrd))], 
           catClmn = ['rating', 'sentiment'], title = 'glbObsAll excl. impWrd')
glbObsAll excl. impWrd Summary:
(53072, 5)
  rating category summary:
  sentiment category summary:
+--------+-------------------------------+-------------------------------+-----------+
| rating |             review            |              name             | sentiment |
+--------+-------------------------------+-------------------------------+-----------+
|  5.0   | All of my kids have cried ... | Stop Pacifier Sucking with... |     1     |
|  5.0   | We wanted to get something... | Nature's Lullabies Second ... |     1     |
|  5.0   | My daughter had her 1st ba... | Nature's Lullabies Second ... |     1     |
|  4.0   | One of baby's first and fa... |  Lamaze Peekaboo, I Love You  |     1     |
|  5.0   | Very cute interactive book... | SoftPlay Peek-A-Boo Where'... |     1     |
|  5.0   | Beautiful book, I love it ... |   Our Baby Girl Memory Book   |     1     |
|  5.0   | Try this out for a spring ... | Hunnt® Falling Flowers... |     1     |
|  5.0   | very nice Divine Mercy Pen... | Blessed By Pope Benedict X... |     1     |
|  4.0   | We bought the pins as my 6... | Cloth Diaper Pins Stainles... |     1     |
|  5.0   | It has been many years sin... | Cloth Diaper Pins Stainles... |     1     |
+--------+-------------------------------+-------------------------------+-----------+
+-------------------------------+
|           review.Cln          |
+-------------------------------+
| All of my kids have cried ... |
| We wanted to get something... |
| My daughter had her 1st ba... |
| One of babys first and fav... |
| Very cute interactive book... |
| Beautiful book I love it t... |
| Try this out for a spring ... |
| very nice Divine Mercy Pen... |
| We bought the pins as my 6... |
| It has been many years sin... |
+-------------------------------+
[53072 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Split data into training and validation sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use seed=1 so that everyone gets the same result.

In [7]:
glbObsFit, glbObsOOB = glbObsAll.random_split(.9, seed=1)

print 'Training set  : %d data points' % len(glbObsFit)
print 'Validation set: %d data points' % len(glbObsOOB)
Training set  : 47780 data points
Validation set: 5292 data points

Convert SFrame to NumPy array

Just like in the earlier assignments, we provide you with a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels.

Note: The feature matrix includes an additional column '.intercept' filled with 1's to take account of the intercept term.

In [8]:
import numpy as np

def toNdarray(data_sframe, features, label):
    data_sframe['.intercept'] = 1
    features = ['.intercept'] + features
    features_sframe = data_sframe[features]
    ftrMtrx = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(ftrMtrx, label_array)

Note that we convert both the training and validation sets into NumPy arrays.

Warning: This may take a few minutes.

In [9]:
glbObsFitFtrMtrx, glbObsFitRsp = toNdarray(glbObsFit, impWrd, 'sentiment')
glbObsOOBFtrMtrx, glbObsOOBRsp = toNdarray(glbObsOOB, impWrd, 'sentiment') 

Are you running this notebook on an Amazon EC2 t2.micro instance? (If you are using your own machine, please skip this section)

It has been reported that t2.micro instances do not provide sufficient power to complete the conversion in acceptable amount of time. For interest of time, please refrain from running toNdarray function. Instead, download the binary file containing the four NumPy arrays you'll need for the assignment. To load the arrays, run the following commands:

arrays = np.load('module-10-assignment-numpy-arrays.npz')
glbObsFitFtrMtrx, glbObsFitRsp = arrays['glbObsFitFtrMtrx'], arrays['glbObsFitRsp']
glbObsOOBFtrMtrx, glbObsOOBRsp = arrays['glbObsOOBFtrMtrx'], arrays['glbObsOOBRsp']

Quiz question: In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?

Building on logistic regression

Let us now build on Module 3 assignment. Recall from lecture that the link function for logistic regression can be defined as:

$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$

where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of impWrd in the review $\mathbf{x}_i$.

We will use the same code as in Module 3 assignment to make probability predictions, since this part is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.

In [10]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predctProbability(ftrMtrx, coefficients):
    # Take dot product of ftrMtrx and coefficients  
    score = np.dot(ftrMtrx, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1. / (1.+np.exp(-score))    
    return predictions

Derivative of log likelihood with respect to a single coefficient

Let us now work on making minor changes to how the derivative computation is performed for logistic regression.

Recall from the lectures and Module 3 assignment that for logistic regression, the derivative of log likelihood with respect to a single coefficient is as follows:

$$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$

In Module 3 assignment, we wrote a function to compute the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts the following two parameters:

  • errors vector containing $(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w}))$ for all $i$
  • feature vector containing $h_j(\mathbf{x}_i)$ for all $i$

Complete the following code block:

In [11]:
def getFeatureDerivative(errors, feature): 
    
    # Compute the dot product of errors and feature
    ## YOUR CODE HERE
    derivative = np.dot(errors, feature)

    return derivative

Note. We are not using regularization in this assignment, but, as discussed in the optional video, stochastic gradient can also be used for regularized logistic regression.

To verify the correctness of the gradient computation, we provide a function for computing average log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability).

To track the performance of stochastic gradient ascent, we provide a function for computing average log likelihood.

$$\ell\ell_A(\mathbf{w}) = \color{red}{\frac{1}{N}} \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

Note that we made one tiny modification to the log likelihood function (called compute_log_likelihood) in our earlier assignments. We added a $\color{red}{1/N}$ term which averages the log likelihood accross all data points. The $\color{red}{1/N}$ term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent. We will use this function to generate plots that are similar to those you saw in the lecture.

In [12]:
def getAvgLogLikelihood(ftrMtrx, rspVctr, coefficients):
    
    indicator = (rspVctr==+1)
    scores = np.dot(ftrMtrx, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    lp = np.sum((indicator-1)*scores - logexp)/len(ftrMtrx)
    
    return lp

Quiz Question: Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

How are the functions $\ell\ell(\mathbf{w})$ and $\ell\ell_A(\mathbf{w})$ related?

Modifying the derivative for stochastic gradient ascent

Recall from the lecture that the gradient for a single data point $\color{red}{\mathbf{x}_i}$ can be computed using the following formula:

$$ \frac{\partial\ell_{\color{red}{i}}(\mathbf{w})}{\partial w_j} = h_j(\color{red}{\mathbf{x}_i})\left(\mathbf{1}[y_\color{red}{i} = +1] - P(y_\color{red}{i} = +1 | \color{red}{\mathbf{x}_i}, \mathbf{w})\right) $$

Computing the gradient for a single data point

Do we really need to re-write all our code to modify $\partial\ell(\mathbf{w})/\partial w_j$ to $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$?

Thankfully No!. Using NumPy, we access $\mathbf{x}_i$ in the training data using glbObsFitFtrMtrx[i:i+1,:] and $y_i$ in the training data using glbObsFitRsp[i:i+1]. We can compute $\partial\ell_{\color{red}{i}}(\mathbf{w})/\partial w_j$ by re-using all the code written in getFeatureDerivative and predctProbability.

We compute $\partial\ell_{\color{red}{i}}(\mathbf{w})/\partial w_j$ using the following steps:

  • First, compute $P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ using the predctProbability function with glbObsFitFtrMtrx[i:i+1,:] as the first parameter.
  • Next, compute $\mathbf{1}[y_i = +1]$ using glbObsFitRsp[i:i+1].
  • Finally, call the getFeatureDerivative function with glbObsFitFtrMtrx[i:i+1, j] as one of the parameters.

Let us follow these steps for j = 1 and i = 10:

In [13]:
j = 1                        # Feature number
i = 10                       # Data point number
coefficients = np.zeros(194) # A point w at which we are computing the gradient.

predictions = predctProbability(glbObsFitFtrMtrx[i:i+1,:], coefficients)
indicator = (glbObsFitRsp[i:i+1]==+1)

errors = indicator - predictions        
gradient_single_data_point = getFeatureDerivative(errors, glbObsFitFtrMtrx[i:i+1,j])
print "Gradient single data point: %s" % gradient_single_data_point
print "           --> Should print 0.0"
Gradient single data point: 0.0
           --> Should print 0.0

Quiz Question: The code block above computed $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$ for j = 1 and i = 10. Is $\partial\ell_{\color{red}{i}}(\mathbf{w})/{\partial w_j}$ a scalar or a 194-dimensional vector?

Modifying the derivative for using a batch of data points

Stochastic gradient estimates the ascent direction using 1 data point, while gradient uses $N$ data points to decide how to update the the parameters. In an optional video, we discussed the details of a simple change that allows us to use a mini-batch of $B \leq N$ data points to estimate the ascent direction. This simple approach is faster than regular gradient but less noisy than stochastic gradient that uses only 1 data point. Although we encorage you to watch the optional video on the topic to better understand why mini-batches help stochastic gradient, in this assignment, we will simply use this technique, since the approach is very simple and will improve your results.

Given a mini-batch (or a set of data points) $\mathbf{x}_{i}, \mathbf{x}_{i+1} \ldots \mathbf{x}_{i+B}$, the gradient function for this mini-batch of data points is given by: $$ \color{red}{\sum_{s = i}^{i+B}} \frac{\partial\ell_{s}}{\partial w_j} = \color{red}{\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) $$

Computing the gradient for a "mini-batch" of data points

Using NumPy, we access the points $\mathbf{x}_i, \mathbf{x}_{i+1} \ldots \mathbf{x}_{i+B}$ in the training data using glbObsFitFtrMtrx[i:i+B,:] and $y_i$ in the training data using glbObsFitRsp[i:i+B].

We can compute $\color{red}{\sum_{s = i}^{i+B}} \partial\ell_{s}/\partial w_j$ easily as follows:

In [14]:
j = 1                        # Feature number
i = 10                       # Data point start
B = 10                       # Mini-batch size
coefficients = np.zeros(194) # A point w at which we are computing the gradient.

predictions = predctProbability(glbObsFitFtrMtrx[i:i+B,:], coefficients)
indicator = (glbObsFitRsp[i:i+B]==+1)

errors = indicator - predictions        
gradient_mini_batch = getFeatureDerivative(errors, glbObsFitFtrMtrx[i:i+B,j])
print "Gradient mini-batch data points: %s" % gradient_mini_batch
print "                --> Should print 1.0"
Gradient mini-batch data points: 1.0
                --> Should print 1.0

Quiz Question: The code block above computed $\color{red}{\sum_{s = i}^{i+B}}\partial\ell_{s}(\mathbf{w})/{\partial w_j}$ for j = 10, i = 10, and B = 10. Is this a scalar or a 194-dimensional vector?

Quiz Question: For what value of B is the term $\color{red}{\sum_{s = 1}^{B}}\partial\ell_{s}(\mathbf{w})/\partial w_j$ the same as the full gradient $\partial\ell(\mathbf{w})/{\partial w_j}$?

Averaging the gradient across a batch

It is a common practice to normalize the gradient update rule by the batch size B:

$$ \frac{\partial\ell_{\color{red}{A}}(\mathbf{w})}{\partial w_j} \approx \color{red}{\frac{1}{B}} {\sum_{s = i}^{i + B}} h_j(\mathbf{x}_s)\left(\mathbf{1}[y_s = +1] - P(y_s = +1 | \mathbf{x}_s, \mathbf{w})\right) $$

In other words, we update the coefficients using the average gradient over data points (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of all the data points), and study the effect of batch size on the algorithm as well as the choice of step size.

Implementing stochastic gradient ascent

Now we are ready to implement our own logistic regression with stochastic gradient ascent. Complete the following function to fit a logistic regression model using gradient ascent:

In [23]:
from math import sqrt
def fitLogisticRSGMdl(ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter):
    logLikelihoodVctr = []
    
    # make sure it's a numpy array
    coefficients = np.array(initCff)
    # set seed=1 to produce consistent results
    np.random.seed(seed=1)
    # Shuffle the data before starting
    permutation = np.random.permutation(len(ftrMtrx))
    ftrMtrx = ftrMtrx[permutation,:]
    rspVctr = rspVctr[permutation]
    
    i = 0 # index of current batch
    # Do a linear scan over data
    for thsIter in xrange(maxIter):
        # Predict P(y_i = +1|x_i,w) using your predctProbability() function
        # Make sure to slice the i-th row of ftrMtrx with [i:i+batchSize,:]
        ### YOUR CODE HERE
        predictions = predctProbability(ftrMtrx[i:i+batchSize,:], coefficients)
        
        # Compute indicator value for (y_i = +1)
        # Make sure to slice the i-th entry with [i:i+batchSize]
        ### YOUR CODE HERE
        indicator = (rspVctr[i:i+batchSize] == +1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            # Recall that ftrMtrx[:,j] is the feature column associated with coefficients[j]
            # Compute the derivative for coefficients[j] and save it to derivative.
            # Make sure to slice the i-th row of ftrMtrx with [i:i+batchSize,j]
            ### YOUR CODE HERE
            derivative = getFeatureDerivative(errors, ftrMtrx[i:i+batchSize,j])
            
            # compute the product of the step size, the derivative, and the **normalization constant** (1./batchSize)
            ### YOUR CODE HERE
            coefficients[j] += stepSize * derivative / batchSize
        
        # Checking whether log likelihood is increasing
        # Print the log likelihood over the *current batch*
        lp = getAvgLogLikelihood(ftrMtrx[i:i+batchSize,:], rspVctr[i:i+batchSize],
                                        coefficients)
        logLikelihoodVctr.append(lp)
        
        if  (thsIter <=    10)                          or \
            (thsIter <=   100 and thsIter %    10 == 0) or \
            (thsIter <=  1000 and thsIter %   100 == 0) or \
            (thsIter <= 10000 and thsIter %  1000 == 0) or \
                                 (thsIter % 10000 == 0) or \
            (thsIter == maxIter-1):
            nObs = len(ftrMtrx)
            print 'Iteration %*d: Average log likelihood (of data points in batch [%0*d:%0*d]) = %.8f' % \
                (int(np.ceil(np.log10(maxIter))), thsIter, \
                 int(np.ceil(np.log10(nObs))), i, \
                 int(np.ceil(np.log10(nObs))), i+batchSize, lp)
        
        # if we made a complete pass over data, shuffle and restart
        i += batchSize
        if i+batchSize > len(ftrMtrx):
            permutation = np.random.permutation(len(ftrMtrx))
            ftrMtrx = ftrMtrx[permutation,:]
            rspVctr = rspVctr[permutation]
            i = 0
                
    # We return the list of log likelihoods for plotting purposes.
    return coefficients, logLikelihoodVctr

Note. In practice, the final set of coefficients is rarely used; it is better to use the average of the last K sets of coefficients instead, where K should be adjusted depending on how fast the log likelihood oscillates around the optimum.

Checkpoint

The following cell tests your stochastic gradient ascent function using a toy dataset consisting of two data points. If the test does not pass, make sure you are normalizing the gradient update rule correctly.

In [24]:
sample_ftrMtrx = np.array([[1.,2.,-1.], [1.,0.,1.]])
sample_rspVctr = np.array([+1, -1])

coefficients, log_likelihood = fitLogisticRSGMdl(sample_ftrMtrx, sample_rspVctr, np.zeros(3),
                                                  stepSize=1., batchSize=2, maxIter=2)
print '-------------------------------------------------------------------------------------'
print 'Coefficients learned                 :', coefficients
print 'Average log likelihood per-iteration :', log_likelihood
if np.allclose(coefficients, np.array([-0.09755757,  0.68242552, -0.7799831]), atol=1e-3)\
  and np.allclose(log_likelihood, np.array([-0.33774513108142956, -0.2345530939410341])):
    # pass if elements match within 1e-3
    print '-------------------------------------------------------------------------------------'
    print 'Test passed!'
else:
    print '-------------------------------------------------------------------------------------'
    print 'Test failed'
Iteration 0: Average log likelihood (of data points in batch [0:2]) = -0.33774513
Iteration 1: Average log likelihood (of data points in batch [0:2]) = -0.23455309
-------------------------------------------------------------------------------------
Coefficients learned                 : [-0.09755757  0.68242552 -0.7799831 ]
Average log likelihood per-iteration : [-0.33774513108142956, -0.2345530939410341]
-------------------------------------------------------------------------------------
Test passed!

Compare convergence behavior of stochastic gradient ascent

For the remainder of the assignment, we will compare stochastic gradient ascent against batch gradient ascent. For this, we need a reference implementation of batch gradient ascent. But do we need to implement this from scratch?

Quiz Question: For what value of batch size B above is the stochastic gradient ascent function fitLogisticRSGMdl act as a standard gradient ascent algorithm?

Running gradient ascent using the stochastic gradient ascent implementation

In [68]:
glbMdlDf = pd.DataFrame()
In [67]:
# fitLogisticRSGMdl should be defined as LogisticRSG ??? & myfitMdl shd be fitLogisticRSGMdl ???
# try dicts for thsMdlDf['cff'] & thsMdlDf['logLhood']
def myfitMdl(mdlDf, ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter):
    import datetime
    
    if (mdlDf.shape[0] == 0) or \
       (mdlDf[(mdlDf.stepSize  ==  stepSize) &
                 (mdlDf.batchSize == batchSize) &
                 (mdlDf.maxIter   ==   maxIter) &
                 True].shape[0] == 0):
        thsMdlDf = pd.DataFrame({
                'stepSize' : stepSize ,
                'batchSize': batchSize,
                'maxIter'  : maxIter  
                                }, index = [0])
        thsMdlDf.set_index(['stepSize', 'batchSize', 'maxIter'], drop = False, inplace = True)
        #set_index(self, keys, drop=True, append=False, inplace=False, verify_integrity=False)
        startTm = datetime.datetime.now()
        print('\nrunning fitLogisticRSGMdl: stepSize: %.4e; batchSize: %d; maxIter: %d' %
             (stepSize, batchSize, maxIter))
        mdlCff, mdlLogLhood = \
            fitLogisticRSGMdl(ftrMtrx, rspVctr, initCff, stepSize, batchSize, maxIter)
        thsMdlDf['cff'] = mdlCff.tostring()
        thsMdlDf['logLhood'] = str(mdlLogLhood)
        thsMdlDf['elapsedSecs'] = (datetime.datetime.now() - startTm).seconds
        #glbMdlDf = glbMdlDf.append(mdlDf)
        return(thsMdlDf)

Instead of implementing batch gradient ascent separately, we save time by re-using the stochastic gradient ascent function we just wrote — to perform gradient ascent, it suffices to set batchSize to the number of data points in the training data. Yes, we did answer above the quiz question for you, but that is an important point to remember in the future :)

Small Caveat. The batch gradient ascent implementation here is slightly different than the one in the earlier assignments, as we now normalize the gradient update rule.

We now run stochastic gradient ascent over the glbObsFitFtrMtrx for 10 iterations using:

  • initCff = np.zeros(194)
  • stepSize = 5e-1
  • batchSize = 1
  • maxIter = 10
In [28]:
smpCff, smpLogLhood = fitLogisticRSGMdl(glbObsFitFtrMtrx, glbObsFitRsp,
                                        initCff=np.zeros(194),
                                        stepSize=5e-1, batchSize=1, maxIter=10)
Iteration 0: Average log likelihood (of data points in batch [00000:00001]) = -0.25192908
Iteration 1: Average log likelihood (of data points in batch [00001:00002]) = -0.00000001
Iteration 2: Average log likelihood (of data points in batch [00002:00003]) = -0.12692771
Iteration 3: Average log likelihood (of data points in batch [00003:00004]) = -0.02969101
Iteration 4: Average log likelihood (of data points in batch [00004:00005]) = -0.02668819
Iteration 5: Average log likelihood (of data points in batch [00005:00006]) = -0.04332901
Iteration 6: Average log likelihood (of data points in batch [00006:00007]) = -0.02368802
Iteration 7: Average log likelihood (of data points in batch [00007:00008]) = -0.12686897
Iteration 8: Average log likelihood (of data points in batch [00008:00009]) = -0.04468879
Iteration 9: Average log likelihood (of data points in batch [00009:00010]) = -0.00000124

Quiz Question. When you set batchSize = 1, as each iteration passes, how does the average log likelihood in the batch change?

  • Increases
  • Decreases
  • Fluctuates
In [31]:
print(type(smpCff))
#print(smpCff.tostring())
print(type(smpLogLhood))
print(str(smpLogLhood))
<type 'numpy.ndarray'>
<type 'list'>
[-0.25192908134537295, -8.5024389389541005e-09, -0.12692771260639515, -0.029691014785741033, -0.026688191711023244, -0.043329008764204917, -0.023688017825477341, -0.12686897415874313, -0.044688785447222955, -1.2350307603270494e-06]
In [69]:
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
                                        initCff=np.zeros(194),
                                        stepSize=5e-1, batchSize=1, maxIter=10))
print(glbMdlDf)
running fitLogisticRSGMdl: stepSize: 5.0000e-01; batchSize: 1; maxIter: 10
Iteration 0: Average log likelihood (of data points in batch [00000:00001]) = -0.25192908
Iteration 1: Average log likelihood (of data points in batch [00001:00002]) = -0.00000001
Iteration 2: Average log likelihood (of data points in batch [00002:00003]) = -0.12692771
Iteration 3: Average log likelihood (of data points in batch [00003:00004]) = -0.02969101
Iteration 4: Average log likelihood (of data points in batch [00004:00005]) = -0.02668819
Iteration 5: Average log likelihood (of data points in batch [00005:00006]) = -0.04332901
Iteration 6: Average log likelihood (of data points in batch [00006:00007]) = -0.02368802
Iteration 7: Average log likelihood (of data points in batch [00007:00008]) = -0.12686897
Iteration 8: Average log likelihood (of data points in batch [00008:00009]) = -0.04468879
Iteration 9: Average log likelihood (of data points in batch [00009:00010]) = -0.00000124
                            batchSize  maxIter  stepSize  \
stepSize batchSize maxIter                                 
0.5      1         10               1       10       0.5   

                                                                          cff  \
stepSize batchSize maxIter                                                      
0.5      1         10       �R-\ro.�?@֑Y�6�?Ky�鿸\����?XoqU��?|t��r���...   

                                                                     logLhood  \
stepSize batchSize maxIter                                                      
0.5      1         10       [-0.25192908134537295, -8.5024389389541005e-09...   

                            elapsedSecs  
stepSize batchSize maxIter               
0.5      1         10                 0  

Now run batch gradient ascent over the glbObsFitFtrMtrx for 200 iterations using:

  • initCff = np.zeros(194)
  • stepSize = 5e-1
  • batchSize = len(glbObsFitFtrMtrx)
  • maxIter = 200
In [71]:
# YOUR CODE HERE
#coefficients_batch, log_likelihood_batch = ...
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
                                        initCff = np.zeros(194),
                            stepSize = 5e-1, batchSize = len(glbObsFitFtrMtrx), maxIter = 200))
print(glbMdlDf)
running fitLogisticRSGMdl: stepSize: 5.0000e-01; batchSize: 47780; maxIter: 200
Iteration   0: Average log likelihood (of data points in batch [00000:47780]) = -0.68308119
Iteration   1: Average log likelihood (of data points in batch [00000:47780]) = -0.67394599
Iteration   2: Average log likelihood (of data points in batch [00000:47780]) = -0.66555129
Iteration   3: Average log likelihood (of data points in batch [00000:47780]) = -0.65779626
Iteration   4: Average log likelihood (of data points in batch [00000:47780]) = -0.65060701
Iteration   5: Average log likelihood (of data points in batch [00000:47780]) = -0.64392241
Iteration   6: Average log likelihood (of data points in batch [00000:47780]) = -0.63769009
Iteration   7: Average log likelihood (of data points in batch [00000:47780]) = -0.63186462
Iteration   8: Average log likelihood (of data points in batch [00000:47780]) = -0.62640636
Iteration   9: Average log likelihood (of data points in batch [00000:47780]) = -0.62128063
Iteration  10: Average log likelihood (of data points in batch [00000:47780]) = -0.61645691
Iteration  20: Average log likelihood (of data points in batch [00000:47780]) = -0.58016048
Iteration  30: Average log likelihood (of data points in batch [00000:47780]) = -0.55694679
Iteration  40: Average log likelihood (of data points in batch [00000:47780]) = -0.54062740
Iteration  50: Average log likelihood (of data points in batch [00000:47780]) = -0.52843284
Iteration  60: Average log likelihood (of data points in batch [00000:47780]) = -0.51892645
Iteration  70: Average log likelihood (of data points in batch [00000:47780]) = -0.51128179
Iteration  80: Average log likelihood (of data points in batch [00000:47780]) = -0.50498629
Iteration  90: Average log likelihood (of data points in batch [00000:47780]) = -0.49970336
Iteration 100: Average log likelihood (of data points in batch [00000:47780]) = -0.49520194
Iteration 199: Average log likelihood (of data points in batch [00000:47780]) = -0.47126953
                            batchSize  maxIter  stepSize  \
stepSize batchSize maxIter                                 
0.5      1         10               1       10       0.5   
         47780     200          47780      200       0.5   

                                                                          cff  \
stepSize batchSize maxIter                                                      
0.5      1         10       �R-\ro.�?@֑Y�6�?Ky�鿸\����?XoqU��?|t��r���...   
         47780     200      �p:-_p��G��÷B�?�B>���?\r�X)�v�?��Cu��?��...   

                                                                     logLhood  \
stepSize batchSize maxIter                                                      
0.5      1         10       [-0.25192908134537295, -8.5024389389541005e-09...   
         47780     200      [-0.6830811890078895, -0.67394598807986328, -0...   

                            elapsedSecs  
stepSize batchSize maxIter               
0.5      1         10                 0  
         47780     200               48  

Quiz Question. When you set batchSize = len(glbObsFit), as each iteration passes, how does the average log likelihood in the batch change?

  • Increases
  • Decreases
  • Fluctuates

Make "passes" over the dataset

To make a fair comparison betweeen stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows): $$ [\text{# of passes}] = \frac{[\text{# of data points touched so far}]}{[\text{size of dataset}]} $$

Quiz Question Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?

In [72]:
print(2. * 50000 / 100)
1000.0

Log likelihood plots for stochastic gradient ascent

With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use

  • stepSize=1e-1
  • batchSize=100
  • initCff to all zeros.
In [74]:
stepSize = 1e-1
batchSize = 100
num_passes = 10
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)

# coefficients_sgd, log_likelihood_sgd = fitLogisticRSGMdl(glbObsFitFtrMtrx, glbObsFitRsp,
#                                        initCff=np.zeros(194),
#                                        stepSize=1e-1, batchSize=100, maxIter=num_iterations)
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
                                       initCff=np.zeros(194),
                                       stepSize=1e-1, batchSize=100, maxIter=num_iterations))
print(glbMdlDf)
running fitLogisticRSGMdl: stepSize: 1.0000e-01; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68251093
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.67845294
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.68207160
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.67411325
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67804438
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67712546
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.66377074
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67321231
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.66923613
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.67479446
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66501639
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.64518234
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -0.64559035
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.63273134
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.62736525
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -0.60362889
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.60002448
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.58804922
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.56693210
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.60976663
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.54566060
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.48245740
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.46629313
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.47223389
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.52216798
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.52336683
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.46963453
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.47883783
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.46988191
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46365531
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.36466901
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.51096892
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.54670667
                            batchSize  maxIter  stepSize  \
stepSize batchSize maxIter                                 
0.5      1         10               1       10       0.5   
         47780     200          47780      200       0.5   
0.1      100       4770           100     4770       0.1   

                                                                          cff  \
stepSize batchSize maxIter                                                      
0.5      1         10       �R-\ro.�?@֑Y�6�?Ky�鿸\����?XoqU��?|t��r���...   
         47780     200      �p:-_p��G��÷B�?�B>���?\r�X)�v�?��Cu��?��...   
0.1      100       4770     f�=�˜���]y���?\O[2��?�}�f���?�ɬݥ%�?ov��Ivr�P...   

                                                                     logLhood  \
stepSize batchSize maxIter                                                      
0.5      1         10       [-0.25192908134537295, -8.5024389389541005e-09...   
         47780     200      [-0.6830811890078895, -0.67394598807986328, -0...   
0.1      100       4770     [-0.68251092856556295, -0.67845294380099008, -...   

                            elapsedSecs  
stepSize batchSize maxIter               
0.5      1         10                 0  
         47780     200               48  
0.1      100       4770               6  

We provide you with a utility function to plot the average log likelihood as a function of the number of passes.

In [75]:
import matplotlib.pyplot as plt
%matplotlib inline

def pltLogLhood(logLikelihoodVctr, len_data, batchSize, smoothing_window=1, label=''):
    plt.rcParams.update({'figure.figsize': (9,5)})
    logLikelihoodVctr_ma = np.convolve(np.array(logLikelihoodVctr), \
                                        np.ones((smoothing_window,))/smoothing_window, mode='valid')
    plt.plot(np.array(range(smoothing_window-1, len(logLikelihoodVctr)))*float(batchSize)/len_data,
             logLikelihoodVctr_ma, linewidth=4.0, label=label)
    plt.rcParams.update({'font.size': 16})
    plt.tight_layout()
    plt.xlabel('# of passes over data')
    plt.ylabel('Average log likelihood per data point')
    plt.legend(loc='lower right', prop={'size':14})
In [80]:
#[float(x) for x in tstStr.strip("[]").split(',')]
pltLogLhood([float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == 1e-1) &
                              (glbMdlDf.batchSize == 100 ) &
                              (glbMdlDf.maxIter   == 4770) 
                             ]['logLhood'][0].strip("[]").split(',')], 
            len_data=len(glbObsFitFtrMtrx), batchSize=100,
              label='stochastic gradient, stepSize=1e-1')

Smoothing the stochastic gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter smoothing_window. The smoothing is simply a moving average of log likelihood over the last smoothing_window "iterations" of stochastic gradient ascent.

In [81]:
pltLogLhood([float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == 1e-1) &
                              (glbMdlDf.batchSize == 100 ) &
                              (glbMdlDf.maxIter   == 4770) 
                             ]['logLhood'][0].strip("[]").split(',')], 
            len_data=len(glbObsFitFtrMtrx), batchSize=100,
              smoothing_window=30,  
              label='stochastic gradient, stepSize=1e-1')
# pltLogLhood(log_likelihood_sgd, len_data=len(glbObsFitFtrMtrx), batchSize=100,
#           smoothing_window=30, label='stochastic gradient, stepSize=1e-1')

Checkpoint: The above plot should look smoother than the previous plot. Play around with smoothing_window. As you increase it, you should see a smoother plot.

Stochastic gradient ascent vs batch gradient ascent

To compare convergence rates for stochastic gradient ascent with batch gradient ascent, we call pltLogLhood() multiple times in the same cell.

We are comparing:

  • stochastic gradient ascent: stepSize = 0.1, batchSize=100
  • batch gradient ascent: stepSize = 0.5, batchSize=len(glbObsFitFtrMtrx)

Write code to run stochastic gradient ascent for 200 passes using:

  • stepSize=1e-1
  • batchSize=100
  • initCff to all zeros.
In [82]:
savMdlDf = glbMdlDf
In [83]:
stepSize = 1e-1
batchSize = 100
num_passes = 200
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)

## YOUR CODE HERE
#coefficients_sgd, log_likelihood_sgd = ...
glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
                                       initCff=np.zeros(194),
                                       stepSize=1e-1, batchSize=100, maxIter=num_iterations))
print(glbMdlDf)
running fitLogisticRSGMdl: stepSize: 1.0000e-01; batchSize: 100; maxIter: 95400
Iteration     0: Average log likelihood (of data points in batch [00000:00100]) = -0.68251093
Iteration     1: Average log likelihood (of data points in batch [00100:00200]) = -0.67845294
Iteration     2: Average log likelihood (of data points in batch [00200:00300]) = -0.68207160
Iteration     3: Average log likelihood (of data points in batch [00300:00400]) = -0.67411325
Iteration     4: Average log likelihood (of data points in batch [00400:00500]) = -0.67804438
Iteration     5: Average log likelihood (of data points in batch [00500:00600]) = -0.67712546
Iteration     6: Average log likelihood (of data points in batch [00600:00700]) = -0.66377074
Iteration     7: Average log likelihood (of data points in batch [00700:00800]) = -0.67321231
Iteration     8: Average log likelihood (of data points in batch [00800:00900]) = -0.66923613
Iteration     9: Average log likelihood (of data points in batch [00900:01000]) = -0.67479446
Iteration    10: Average log likelihood (of data points in batch [01000:01100]) = -0.66501639
Iteration    20: Average log likelihood (of data points in batch [02000:02100]) = -0.64518234
Iteration    30: Average log likelihood (of data points in batch [03000:03100]) = -0.64559035
Iteration    40: Average log likelihood (of data points in batch [04000:04100]) = -0.63273134
Iteration    50: Average log likelihood (of data points in batch [05000:05100]) = -0.62736525
Iteration    60: Average log likelihood (of data points in batch [06000:06100]) = -0.60362889
Iteration    70: Average log likelihood (of data points in batch [07000:07100]) = -0.60002448
Iteration    80: Average log likelihood (of data points in batch [08000:08100]) = -0.58804922
Iteration    90: Average log likelihood (of data points in batch [09000:09100]) = -0.56693210
Iteration   100: Average log likelihood (of data points in batch [10000:10100]) = -0.60976663
Iteration   200: Average log likelihood (of data points in batch [20000:20100]) = -0.54566060
Iteration   300: Average log likelihood (of data points in batch [30000:30100]) = -0.48245740
Iteration   400: Average log likelihood (of data points in batch [40000:40100]) = -0.46629313
Iteration   500: Average log likelihood (of data points in batch [02300:02400]) = -0.47223389
Iteration   600: Average log likelihood (of data points in batch [12300:12400]) = -0.52216798
Iteration   700: Average log likelihood (of data points in batch [22300:22400]) = -0.52336683
Iteration   800: Average log likelihood (of data points in batch [32300:32400]) = -0.46963453
Iteration   900: Average log likelihood (of data points in batch [42300:42400]) = -0.47883783
Iteration  1000: Average log likelihood (of data points in batch [04600:04700]) = -0.46988191
Iteration  2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46365531
Iteration  3000: Average log likelihood (of data points in batch [13800:13900]) = -0.36466901
Iteration  4000: Average log likelihood (of data points in batch [18400:18500]) = -0.51096892
Iteration  5000: Average log likelihood (of data points in batch [23000:23100]) = -0.43544394
Iteration  6000: Average log likelihood (of data points in batch [27600:27700]) = -0.45656653
Iteration  7000: Average log likelihood (of data points in batch [32200:32300]) = -0.42656766
Iteration  8000: Average log likelihood (of data points in batch [36800:36900]) = -0.39989352
Iteration  9000: Average log likelihood (of data points in batch [41400:41500]) = -0.45267388
Iteration 10000: Average log likelihood (of data points in batch [46000:46100]) = -0.45394262
Iteration 20000: Average log likelihood (of data points in batch [44300:44400]) = -0.48958438
Iteration 30000: Average log likelihood (of data points in batch [42600:42700]) = -0.41913672
Iteration 40000: Average log likelihood (of data points in batch [40900:41000]) = -0.45899229
Iteration 50000: Average log likelihood (of data points in batch [39200:39300]) = -0.46859254
Iteration 60000: Average log likelihood (of data points in batch [37500:37600]) = -0.41599369
Iteration 70000: Average log likelihood (of data points in batch [35800:35900]) = -0.49905981
Iteration 80000: Average log likelihood (of data points in batch [34100:34200]) = -0.45494095
Iteration 90000: Average log likelihood (of data points in batch [32400:32500]) = -0.43220080
Iteration 95399: Average log likelihood (of data points in batch [47600:47700]) = -0.50265709
                            batchSize  maxIter  stepSize  \
stepSize batchSize maxIter                                 
0.5      1         10               1       10       0.5   
         47780     200          47780      200       0.5   
0.1      100       4770           100     4770       0.1   
                   95400          100    95400       0.1   

                                                                          cff  \
stepSize batchSize maxIter                                                      
0.5      1         10       �R-\ro.�?@֑Y�6�?Ky�鿸\����?XoqU��?|t��r���...   
         47780     200      �p:-_p��G��÷B�?�B>���?\r�X)�v�?��Cu��?��...   
0.1      100       4770     f�=�˜���]y���?\O[2��?�}�f���?�ɬݥ%�?ov��Ivr�P...   
                   95400     a��j����p0���?�i�0'�?�{����?)Spw��?�_�@JX?...   

                                                                     logLhood  \
stepSize batchSize maxIter                                                      
0.5      1         10       [-0.25192908134537295, -8.5024389389541005e-09...   
         47780     200      [-0.6830811890078895, -0.67394598807986328, -0...   
0.1      100       4770     [-0.68251092856556295, -0.67845294380099008, -...   
                   95400    [-0.68251092856556295, -0.67845294380099008, -...   

                            elapsedSecs  
stepSize batchSize maxIter               
0.5      1         10                 0  
         47780     200               48  
0.1      100       4770               6  
                   95400            126  

We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with smoothing_window=30.

In [87]:
sgdLogLhood = [float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == 1e-1) &
                              (glbMdlDf.batchSize == 100 ) &
                              (glbMdlDf.maxIter   == 95400) 
                             ]['logLhood'][0].strip("[]").split(',')]
print(sgdLogLhood[:5])
[-0.682510928565563, -0.6784529438009901, -0.6820716036260345, -0.6741132500973214, -0.6780443812596296]
In [85]:
btcLogLhood = [float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == 5e-1  ) &
                              (glbMdlDf.batchSize == 47780 ) &
                              (glbMdlDf.maxIter   == 200   ) 
                             ]['logLhood'][0].strip("[]").split(',')]
print(btcLogLhood[:5])
[-0.6830811890078895, -0.6739459880798633, -0.6655512906167012, -0.65779625861419, -0.6506070057710562]
In [86]:
pltLogLhood(sgdLogLhood, len_data=len(glbObsFitFtrMtrx), batchSize=100,
          smoothing_window=30, label='stochastic, stepSize=1e-1')
pltLogLhood(btcLogLhood, len_data=len(glbObsFitFtrMtrx), batchSize=len(glbObsFitFtrMtrx),
          smoothing_window=1, label='batch, stepSize=5e-1')

Quiz Question: In the figure above, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?

  1. It's always better
  2. 10 passes
  3. 20 passes
  4. 150 passes or more

Explore the effects of step sizes on stochastic gradient ascent

In previous sections, we chose step sizes for you. In practice, it helps to know how to choose good step sizes yourself.

To start, we explore a wide range of step sizes that are equally spaced in the log space. Run stochastic gradient ascent with stepSize set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2. Use the following set of parameters:

  • initCff=np.zeros(194)
  • batchSize=100
  • maxIter initialized so as to run 10 passes over the data.
In [88]:
savMdlDf = glbMdlDf
In [ ]:
# batchSize = 100
# num_passes = 10
# num_iterations = ...

# coefficients_sgd = {}
# log_likelihood_sgd = {}
# for stepSize in np.logspace(-4, 2, num=7):
#     coefficients_sgd[stepSize], log_likelihood_sgd[stepSize] = ...
In [89]:
batchSize = 100
num_passes = 10
num_iterations = num_passes * int(len(glbObsFitFtrMtrx)/batchSize)

for stepSize in np.logspace(-4, 2, num=7):
    glbMdlDf = glbMdlDf.append(myfitMdl(glbMdlDf, glbObsFitFtrMtrx, glbObsFitRsp,
                                       initCff=np.zeros(194),
                        stepSize = stepSize, batchSize = batchSize, maxIter=num_iterations))
    
print(glbMdlDf)
running fitLogisticRSGMdl: stepSize: 1.0000e-04; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69313622
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69313170
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69313585
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69312487
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69313157
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69313113
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69311121
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69312692
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69312115
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69312811
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69311286
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.69309901
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -0.69309240
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.69307082
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.69304682
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -0.69300736
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.69299979
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.69297210
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.69294970
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69298918
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.69277472
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.69228764
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.69222554
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.69186710
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.69230650
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.69174220
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.69139955
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.69123818
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.69088883
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.68976850
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.68569701
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.68597545
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.68736824

running fitLogisticRSGMdl: stepSize: 1.0000e-03; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69303759
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69299241
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69303389
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69292442
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69299113
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69298668
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69278828
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69294460
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69288708
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69295651
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69280480
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.69266501
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -0.69259969
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.69238498
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.69215059
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -0.69175643
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.69168304
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.69141178
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.69117504
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69158805
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.68946852
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.68492418
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.68415366
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.68114554
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.68489867
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.68027821
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.67693088
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.67561867
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.67367588
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.66156206
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.62798175
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.64157978
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.64571292

running fitLogisticRSGMdl: stepSize: 1.0000e-02; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69205420
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69160695
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69201686
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69095428
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69159348
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69154386
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.68964000
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69112685
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69056997
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69124730
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.68980179
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.68827963
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -0.68771080
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.68566280
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.68378499
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -0.67993013
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.67935758
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.67707023
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.67390961
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.67916061
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.66049079
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.63235099
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.62183600
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.61150928
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.62979300
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.61553432
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.59156014
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.58842264
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.59076267
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.54480104
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.45761063
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.54362587
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.56306510

running fitLogisticRSGMdl: stepSize: 1.0000e+00; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.61201447
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.58843678
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.59771677
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.58770466
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.56939710
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.57554451
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.54068090
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.55212916
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.55311029
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.57672007
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.55455807
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.48175763
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -0.50945147
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.49154317
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.48056425
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -0.45655749
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.43759667
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.40860207
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.41486032
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.50102061
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.42406927
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.35064478
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.38344116
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.40170047
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.45117863
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.46493371
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.45343350
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43128394
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.43169967
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.43029376
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.32703099
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.49162447
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.52452720

running fitLogisticRSGMdl: stepSize: 1.0000e+01; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.51319004
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -2.20035379
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -3.34199720
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -3.06285156
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -2.80822162
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -2.99629286
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -2.71489944
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -3.61713200
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -1.19526584
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.75357081
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.71310829
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -0.46458251
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -1.08273250
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -0.48897038
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -0.53930475
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -1.30521484
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -0.50927370
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -0.32503033
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -0.40436582
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -1.19013437
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.48968363
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.72860037
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.58719556
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.31220572
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -1.89468446
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.96096585
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.66616640
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.46114004
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.47236476
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.45227508
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.29378688
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -2.47834692
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -2.48776279

running fitLogisticRSGMdl: stepSize: 1.0000e+02; batchSize: 100; maxIter: 4770
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -2.44471310
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -36.66862050
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -25.49870239
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -40.14565040
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -27.03748522
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -32.62294582
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -25.88017915
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -37.30720216
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -10.87360529
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -6.60878996
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -7.15375088
Iteration   20: Average log likelihood (of data points in batch [02000:02100]) = -4.49577584
Iteration   30: Average log likelihood (of data points in batch [03000:03100]) = -9.69907670
Iteration   40: Average log likelihood (of data points in batch [04000:04100]) = -3.02162837
Iteration   50: Average log likelihood (of data points in batch [05000:05100]) = -6.83532823
Iteration   60: Average log likelihood (of data points in batch [06000:06100]) = -13.19072630
Iteration   70: Average log likelihood (of data points in batch [07000:07100]) = -1.49014891
Iteration   80: Average log likelihood (of data points in batch [08000:08100]) = -2.14678829
Iteration   90: Average log likelihood (of data points in batch [09000:09100]) = -13.37462833
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -10.64501703
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -4.03566678
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -4.52617181
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -5.15162255
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -4.71897542
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -20.09445071
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -10.30858950
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -4.33477503
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -2.36320107
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -3.06233274
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -2.83645630
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.77701793
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -30.17399169
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -28.61787372
                            batchSize  maxIter  stepSize  \
stepSize batchSize maxIter                                 
0.5000   1         10               1       10    0.5000   
         47780     200          47780      200    0.5000   
0.1000   100       4770           100     4770    0.1000   
                   95400          100    95400    0.1000   
0.0001   100       4770           100     4770    0.0001   
0.0010   100       4770           100     4770    0.0010   
0.0100   100       4770           100     4770    0.0100   
1.0000   100       4770           100     4770    1.0000   
10.0000  100       4770           100     4770   10.0000   
100.0000 100       4770           100     4770  100.0000   

                                                                          cff  \
stepSize batchSize maxIter                                                      
0.5000   1         10       �R-\ro.�?@֑Y�6�?Ky�鿸\����?XoqU��?|t��r���...   
         47780     200      �p:-_p��G��÷B�?�B>���?\r�X)�v�?��Cu��?��...   
0.1000   100       4770     f�=�˜���]y���?\O[2��?�}�f���?�ɬݥ%�?ov��Ivr�P...   
                   95400     a��j����p0���?�i�0'�?�{����?)Spw��?�_�@JX?...   
0.0001   100       4770     J�%�]/?���uS.r?{�:��n����?�+�mД?ؒ�G�>L?�8...   
0.0010   100       4770     I�A'v�?cڪ�3�?�_�A�~���=�{{�?Kܼ\��?ĭ}�h�?c7K...   
0.0100   100       4770     �g���P��E� ��?��52�t?���a���?�"Gm���?Med�K�?...   
1.0000   100       4770     췎lf�ÿ�"i�킿Ϋ�<�c�?��p���?��ڞ��?���<%���+�...   
10.0000  100       4770     F���Y��?3�t�0�?`/����?h\r�o�a@���z�@T��'...   
100.0000 100       4770     �Q�� �1@&��`0@�̫@�/@�g�P7@@e�Wv0@@��Y�H'@...   

                                                                     logLhood  \
stepSize batchSize maxIter                                                      
0.5000   1         10       [-0.25192908134537295, -8.5024389389541005e-09...   
         47780     200      [-0.6830811890078895, -0.67394598807986328, -0...   
0.1000   100       4770     [-0.68251092856556295, -0.67845294380099008, -...   
                   95400    [-0.68251092856556295, -0.67845294380099008, -...   
0.0001   100       4770     [-0.6931362183867823, -0.69313169639053884, -0...   
0.0010   100       4770     [-0.69303758824365824, -0.69299241358323871, -...   
0.0100   100       4770     [-0.69205419887263875, -0.69160694682691759, -...   
1.0000   100       4770     [-0.61201446796296499, -0.58843678479672146, -...   
10.0000  100       4770     [-0.51319004065115859, -2.2003537949091303, -3...   
100.0000 100       4770     [-2.4447130978468219, -36.668620504352454, -25...   

                            elapsedSecs  
stepSize batchSize maxIter               
0.5000   1         10                 0  
         47780     200               48  
0.1000   100       4770               6  
                   95400            126  
0.0001   100       4770               6  
0.0010   100       4770               6  
0.0100   100       4770               6  
1.0000   100       4770               6  
10.0000  100       4770               6  
100.0000 100       4770               6  

Plotting the log likelihood as a function of passes for each step size

Now, we will plot the change in log likelihood using the pltLogLhood for each of the following values of stepSize:

  • stepSize = 1e-4
  • stepSize = 1e-3
  • stepSize = 1e-2
  • stepSize = 1e-1
  • stepSize = 1e0
  • stepSize = 1e1
  • stepSize = 1e2

For consistency, we again apply smoothing_window=30.

In [90]:
for stepSize in np.logspace(-4, 2, num=7):
    sgdLogLhood = [float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == stepSize) &
                              (glbMdlDf.batchSize == 100 ) &
                              (glbMdlDf.maxIter   == 4770) 
                             ]['logLhood'][0].strip("[]").split(',')]
#     pltLogLhood(log_likelihood_sgd[stepSize], len_data=len(glbObsFit), batchSize=100,
#               smoothing_window=30, label='stepSize=%.1e'%stepSize)
    pltLogLhood(sgdLogLhood, len_data=len(glbObsFit), batchSize=100,
              smoothing_window=30, label='stepSize=%.1e'%stepSize)    

Now, let us remove the step size stepSize = 1e2 and plot the rest of the curves.

In [91]:
for stepSize in np.logspace(-4, 2, num=7)[0:6]:
    sgdLogLhood = [float(ll) for ll in 
                     glbMdlDf[(glbMdlDf.stepSize  == stepSize) &
                              (glbMdlDf.batchSize == 100 ) &
                              (glbMdlDf.maxIter   == 4770) 
                             ]['logLhood'][0].strip("[]").split(',')]
#     pltLogLhood(log_likelihood_sgd[stepSize], len_data=len(glbObsFit), batchSize=100,
#               smoothing_window=30, label='stepSize=%.1e'%stepSize)
    pltLogLhood(sgdLogLhood, len_data=len(glbObsFit), batchSize=100,
              smoothing_window=30, label='stepSize=%.1e'%stepSize)        

Quiz Question: Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.

  1. 1e-2
  2. 1e-1
  3. 1e0
  4. 1e1
  5. 1e2

Quiz Question: Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.

  1. 1e-4
  2. 1e-2
  3. 1e0
  4. 1e1
  5. 1e2
In [ ]: